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Abstract 

The four main approaches to nieasuring treatment effects in schools 
raw gainj residual gains covariance, and true scoresi werei compared, A 
siinulation study showed true score analysis praduced a large number of 
Type-1 errors. Whan corrected for this error, the method showed the 
least power of the four* This outcome was ctlearly the result o.t the 
computational method which adds dependent variabla information into the 
independent variable to form the true score. Covariance analysis was 
recommended, with reservation, as the method of choice. 
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Analysis of Covariance^ Is It the Appropriate 
^ Model to Study Change? 

Paul T, Marston and Gary D. Borich 

In many testing situations, it is found that the individual dlfferances 
are large relative to the size of the treatment effects being studied* What 
is needed Is a stethod to cQmpensate for the effects of these individual 
differences on the outcome measure so that the effect of the treatment can 
be accurately assessed. A number of statistical models have been designed 
to make this type of adjustment—usually by measuring change. Four change 
models are considered in this paper i (a) analysis of difference scores * (b) 
analysis of residual gains, (c) analysis of covarlancej and (d) analysis of 
covariance with true score adjustment. The assumpcions underlying these models 
are examined and then the results of a Monte^Carlo simulation study comparing 
the four methods is reported. 

All of these methods start with the assumption that an individual's 
posttest score j ^5 can be thought of as a linear combination of a number of 
factors including the initial level of performance. In the case of the single 
treatment being considered here^ the even stronger assumption is made that the 
only Important factors are the effect of the treatment, and" the individual's 
pretest performance ^ x^. The critical differences between these models Involve 
the assumptions made about the relationship between the pretest score and 
posttest score. All four models assumed that the relationship linear and 
is the same for all individuals* 

The method which makes the strictest assumptions is the analysis of 
difference scores. In it the posttest score is thought to consist of a treat- 
ment effect plus the individual 's precest score. In other words, everyone 
in a specific treatment group is expected to change by the same amount r The 
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usual approach in making an analysis of difference scores is to form a gain 
score by subcraccing each individual's pretast from their posttest score* 
Then a fixed effect analysis of variance or a t-test is made on these gain 
scores. The two trial, mixed model analysis of variance is also used for 
pretesc-posttest designs » but as Huck and McLean (1975) have pointed outs this 
analysis of variance is formally identical to the analysis of difference scores* 

The assumption of a constant change for every individual in a group does 
not appear to be a reasonable one for many situations , When posttests scores 
are plotted as a function of their respective pretest valuer it is usual to 
observe regression toward the mean* That is, an individual with a high initial 
score is more likely to score lower on the posttest and conversely, an 
individual with a low pretest scores is likely to raise their score* This means 
that the amount of changa is a function of the initial level of performance when 
everything else is held constant* If the regression ^o£^&he, posttest scores 
on the pretest is reasonably linear then an estimate of the posttest score can 
be made by using a simple correlation model. Gain scores are then formed by 
subtracting an individual -s estimated posttest score Cf armed using the 
regression equation with pretest) from their actual posttest scores. Treat- 
ment effects would then show up as different mean gains for the various 
treatment conditions. Such a procedure is called residual gain analysis. As 
a technique* it gets around the assumption of equal gains for all individuals 
ragardiess of initial score used in the analysis of difference scores, while 
still retaining the type of computation and interpretation found in the latter 
type of analysis* One should bear in mind that the average g.min across all 
groups will be zero in the rasiduallEed gain method. 

In some ways residualiEad gain is very similar to analysis of 
' covar iance^hich*-±s'-eKamined^exCT"BOTh^ 
underlying linear model for estimating the posttest scores; 



y ^ bQ + bj^a + b^K + e . (1) 

Where a and x are the treatment and pretest effectSs _e Is a random errors and 

tbe b - s are weighting coefficients* WiSLt distinguishes the types of analyses 

is how the estimates of the weighting coeffecients are obtained. The analysis 

of covariance makes the fewest assumptions by allowing all three b *s to be 

fitted from the data. Werts and Linn (1970) have shown that in the residuallzed 

gain analysis the value of b^ is assumed to be the same as what would have been 

obtained if the ^ term was not Included in the model. They also sho^r that In 

the difference score analysis the assumption Is that the value of b^^ Is equal 

to 1.0* - So it is clear that the difference score analysis also assumes Model (1) * 

One can thus think of the three analyses as putting pirogressively less restrictive 

assumptions on the s^e theoretical model. For a given data set» analysis of 

covariance can never give a worse representation of the relationships than the 

other two and it may often be better. This is because the least-"Squares solution 

for thts b 's involves the Interrelationship of all the variables* For example, 

the value of b^ in the two group cases depend on the correlations r j r , and 

xy ay 

r * In the resldualized gain analysis the valui3 of b« can only be a function 
ax 

of correlation r_ i Werts and Linn show that the two 'methods will give 

^y 

equivalent results only if there is no correlation between the covariate and 
the treatment varidble. Given the additional work of forming the residuallzed 
gain scores, it is not clear why thara would ever be a preference for what is 
only an approKlmation to analysis of cdvarlance. 

As the assumptions on the theoretical Mdel are progressively relaxed in 
the three methods discussed so far^ a better estimate of obtained and 

consequently a better estimate of the treatment effect j b^^i Is also obtained. 
If this line of reasoning is continued, it appears that when the analysis of 
covariance does not give a good estimate of b^ then an Improved model should be 



sought. This can happen when errors of maasurement occur in obtaining the x 
values. Such errors will lower the obtained correlation between the pretest 
and pQsttest relative to that which would have been obtained if accurate 
measurements had been used. When such accurate measureinents in principal 
cannot be obtained then the hypothetical accurate values the x's represent 
are called "true scores*" The appropriate analysis using these scores is 
logically enough called true score analysis of covariance or true score analysis 
for short. Students of measurement theory have argued that when the covariate 
contains error its correlations with the other variables should be corrected 
for the unreliability before the analysis is done (Cronbach & Furby, 1970). 
The "true" pretest-posttest relationship can be obtained and therefore a 
better estimate of the treatment effect is also obtained*. Such corrections 
are based on some reliability measure for the pretest such as a test"-retest 
Intraclass correlation. - 

Up to this point there appears to be little disagreement In the literature 
as to the merits of the first three models discussed. It is recogniied that 
the covariate may not relate to the dependent variable in a linear fashion or 
even if it does this relationship might be a function of the treatment. Both 
of these assumptions can and should be tested prior to making tests using the 
analysis of covariance model (Draper & Smithy 1966), The assumption that a 
correction should be made for an error of measurement in the covariate Is far 
from universally accepted. For eKample, writers of one textbook state that it 
does not make any difference whether the independent variables are measured 
with or without error (Draper & Smith) while another simply says it limits 
one to making statistical inferences about the obtained scores (Graybllli 1961)* 
Lumsden (1976) has taken the position for ignoring the reliability question 
altogether. He states that the true score Itself can be considered as an 
unreliable measure of some actual characteristic of the individual. For 



example J one oughc to be Interested in the relationship between che obtained 
scores on a math teat and inathematical abilityj not in the ralationship between 
obtained math scores and true math scores. In other words, why stop at 
correction for test reliability? Why not correct for the error of measurement 
between the test and the individual characteristic? The latter obviously 
cannot be dons so why bother with the former. This substantive criticism 
should be borne in mind when considering the other evidence about the true 
score adjustment. 

The treatment of corrections for an analysis of true scores is based on 
a least-squares solution from an adjusted intercorrelatlon matrix (Cohen & 
Cohen, 1975; Cronbach & Furby, 1970; Wferts ^ Linn, 1970) so it is not always 
clear what the estimated true scores would be* One can find out. however, by 
using the adjusted correlation matrix to solve the equation for the true 
score, X ^ 



(2) 



This solutioii gives 
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If these new values are substituted for the original x's In the data, one 
finds the new correlation is 



' A" 



(4) 



which is the correction for unreliability. When a treatment effect is added 

to equation (2) and solved it is found that x is a function of both y and a* 

..... . ^t .... .. . 
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That rasult Is somewhat complex and will not be included here. The key thing 



to note^is that in obtaining the true score values ^ one winds up using the 

posttest to predict ltsalf~a procedure with a certain amount of circularity, 

2 ^ 

In fact* if r is equal or greater than the raliability, R * then the true 

scores will perfectly predict the posttest values. If this happens, then no 

matter how large the treatment effect iSj it must always be estimated as 

zero in the true score analysis. This problem could be circumvented by switching 

2 

to a standard analysis of covariance whenever the magnitude of r is close to 



the reliability correction. Now one only has the problem of deciding how close 
is really close. - " 
The Simulation Study 

A Monte-Carlo type simulation study wa« designed to shed some light on the 
relative merit of these four methods of measuring change. The balsc population 
model was a two group experiment with a linear relationship between the pretest 
and posttest* The linearity between pretest and posttest scores was held 
constant throughout the sample sets at a correlation of *6, This value permitted 
the reliability to be varied over a wide range while still giving a fair amount 
of power for the covariate* Analysis of covariance^ true score analysis * 
analysis of residualiEed gainj and analysis of difference scores were calculated 
for each sample and the number of significant £-tests at four standard levels 
were tabulated. 

To generate the pairs of pretest and posttest observations ^ a set of 
three random normal deviates was required. The value of the pretest was set 
to the first random normal deviate » e ... The posttest score was then calculated 
using the relationship 
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- m + r X, , 
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(5) 
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The parameters m and £^ repreaenc the grand mean and the deviation score for 
group J respectively. To simplify calculations, m was set equal to zero, T:.a 
obtained pretest with errors k* , was obtained using the relationship 

r— ^ — — 

x!, ^ /r X.. + /I - R e^.. , (6) 
3.J / XK / 3ij ^ ^ 

Every sample had two groups of twenty^five observations and there were 
1000 samples in each set. Reliability values ^ R * were sat a *8, and ,5 

XX 

*while differences between the groups on the dependiht variable, were net at 
,6^ ,2j ,1, and 0.0. The latter, of course^ represents a test of the null 
hypothesis* Additionally^ a few simulations were performed with differences 
introduced into the means of both the pretest and posttest groups such that 
the predicted ^values lay on the same regression line* Because the unrelia- 
bility was added in a second stepj it was possible to obtain an actual score 
analysis of covariance by using the k values instead of the values/ In 
soma of the later simulations this was done* The four levels for alpha used 
werei •Olj .05^ .10, and .25. 

After reviewing the first few simulations the program was modified to 
provide descriptive statistics for each of the methods and to provide an 
analysis of covariance on the reliable k scores. 

The true score correction was one suggested by Cohen and Cohen (1975). 
It requires a least-squares procedure that uses the correlation matrix to 
obtain a solution. The raw score correlation matrlK is altered by dividing 
all correlatlous involving the unreliable variable by the square root of R 
and multiplying the standard deviation of the unreliable variable by the same 
value. The least=squares solution is then found for this new matriK, Inspection 
of the b^^ weights for the true score analysis indicated the correction may have 
been too large relatively to the amount of unreliability introduced in the model. 

12 v: 



Some additional sample sets were then produced with the size of the unreriabillty . 
.. correction being set to half the amount of unreliability introduced; For examplej- 
if the reliability of x in equation (6) was set at .8, then the appropriate 
correlations in the raw score matrix were divided by the square root of ,9* 
Results ' / . . . 

The most interesting aspect of the simulations was the distribution of 
. Type=I errors Table 1 presents the average nuraber of significant results for 
the various types of analyses. Three parameters are used to make a chi^squared 
goodness of fit test for an £ distribution (l*e.V number of samples^ df^^ and df^) 
so a chi-squared test with one degree of freedom could be conducted on the. number 
of significants F's . All but one of the true score distributions could be 
rejected as a bad fit at p < .01 while none of the covariancei residualized 
gains or difference score distributions could be rejected at this level* This 
test clearly indical:ed the true F-'tests were not^ the expected 

distribution of Type-I errors. An estimate of the relationship of true score 
alpha level to that used in the tests was found by fitting a second degree 
polynomial with a zero intercept to the number of significant true score 
F-tests as a function of the expected number of significant F-tasts. Figure 1 
shows the plot for a reliability of • 6. With the exception of the p < .01^ 

this curve appears to be a reasonable fit to the values. The half size 
reliability corrections were a good match to the corresponding plot for a 

reliability of .8. It appears that the primary effect of the true score 

correction was to Increase the effective alpha level for the F^=tests. Table 2 

gives the estimated alpha level values for true score CDrrectlon based on the 

polynomial equations. , 

^^en the groups had mean differences on the posttest the true scare 

analysis did produce more significant F^-tests than analysis of covariance. 

The difference In alpha levels of the two methods ^ however, m&de direct 



Table 1 



Nuraber of significant |-test obtainad when: the 
pretsst and posttest means have no differences, 



Expected Number 
Significant 



250 



100 



50 



10 




Error 
in 

X' 



.6 
.6 
.5 



.8 
.6 
.6 

.5 

J 
.8 
.6 
.6 
.5 

.8 

.8' 
.6 
.6 
.5 



Correction 



.9 
.8 
.8 
.6 

.5, 

.9 

.8 
,8 
.6 

.5 

.9 
.8 
.8 
.6 

.5 

.9 
.8 
,8 
'.6 
.5 



; of 
Saniple 
Sets 

; 

3 : 

3 

2. 

1 

3 
1 
3 
2 

1 \ 
3 0, 
1 
3 
2 

1 
3 
1 

3 • 
2 



Covarianci 



239 
259 
233 
250 
240 

91 
105 

88 
102 

96 

52 
47 
53 

45 

5 
9 

12 
11 
10 



True 
Scori 



254 
291 
268 
320 
382 

100 
135 
109 
182 
226 

52 
76 
61 
112 
152 



2D 

la 

4,6 
65 



Risidual 
. Gain 



238 
258, 
234 
250 
239 

89 
105 

90 
101 

97 

44 
50 
47 

■ 55 
46 



9 

13 
10 

11 



Difference 
Scores 



238 
259 
258 
254 
245 

89, 
107 
91 
92 
101 

43 
53 
48 
48 
48 

10 
12 
. 8 
12 
11 
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Figures I 

Distlrbution of significant F-»testa 
for True Score Analysis and Analysis 
of Covarlance when the null hypothesis— 
is true, 





comparison impossible. By plotting the number of Hignlficant tesEs as a 
function of the estimated alpha level, the /power of the true score method could 
be compared with analysis of covariance* Figure J shows the relative power 
for the two types of analyses when the reliability was .6» The analysis of 
covariance appears to be slightly more powerful because at nearly every level 
of alpha it produces more significant results. This would indicate that the 
power of a true score analysis could be obtained more directly by increasing 
the alpha level in analysis of covariance. 

It la possible that true score analysis might do a better job of recovering. 
; the population parametars of the model, TaLie 3 gives the mean and standard 
deviation for each ^ weight calculated in the two types of analyses. As 
expected j introducing unreliability into the pretest scores caused the covariance 
analysis to get a lower value for b^* The true score analysis got a larger 
value for b^^ when the full correction was introduced and came very close to. the 
actual value when the half correction was used. Even bo\, examination of the 
means f or and^ shows that the two methods produced almost identical 
estimates and that these were also almost identical to the mean weights before 
the unreliability was introduced* There was a difference in the standard 
deviations of the weights for the two methods. In almost every casej the 
true score analysis produced a greater variation^^iiv.. the .pstim each 
of the three weights than did the analysis of covariance. In terms of estimating 
Che critical parameter b^^^^ the true score analysis appears to do no better than 
analysis of covariance and in fact may be worse judging by the standard 
deviations of . the weights* . ._ 

The one place where the true score analysis did appear to have an edge 
was that case when there was a mean difference in both the pretest and post-- 
tep j for the groups. Table 4 shows that^ when the half unreliability . 
correction was used,; t ..true . score^,sample_sat.s.lcame:. closer .. t 



12 



Table 2 

Estimated Type-I error rates for 
true score analysis of covarlance 



^P^* (True score correction) / (unreliability introduced)'' 

Used .9/. 8 .8/. 8 .8/. 6 .6/.6 .S/.S 

•01 -OU .015 .015 .022 .030 

.05 .058 .073 .073 .104 .136 

•10 .110 .138 .138 .191 .242 

•25 .275 .290 .290 .352 .380 
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Ftguru 2 ; 

Diatrihution of significant r=tests 
for True Score Analysis (TS) and 
AnaiysiH of Covariance (CV) when the 

null hypothests if false./ 
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Table'-.3(a) , 

Mian values of sample 
regresaion wei|hts 
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.311 
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.599 


.463 


.584 
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.313 


.313 


.d9j 


,456 


.111 












Y - 0 + 


,1a 


.6k 








J ■ 


.9 


.000 


.001 


.001 


.09? 


.096 




T .600 


.538 


.599 


J 




,003 


' . 001 


: ,001 . ^ 


.098 


.099 


aoo ' ^ 


i .600' 


.53? 


.675 


.6 




.009 


.00? . 


.00? ' 


.102 


,104 


.104 


.595 


.462 


.581 


.6 




,007 


'006. 




.099 


.098 


,099 . 


.;593 


.454 


.768 












Y ^ 0 + .05a + .6k 








.8 . 


.8 ■ 


.00? 


.007 


: .007 : 


.042 


.039 


,039 


.604 


.541 : 


.681 


.6 : : 


.6 


.000 


.003 


.003 


.055 


.050 


' .052 


.603 


.468 


.792 



Ledgen; AS Actual Score Analysis of Covarisnce • v .... , 

0! Analysis of Covarlance : , : 

TS True Score Analysis of Covariance ^ - , , , . 
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Table 3(b) 

Standard deviaEion values 
for sample rs|ression weights 



b: 



0 



b, 



1 



AS CV 



TE 



y. Reliability Corrictlon 



AS CV TS 
1- 0+:,3a + .6k 



AS : CV 



TS 



.8 

.8. 


A 


.112 


.119 


.119 


^ .118 .122. .123: 


- ,120 


.130 


... .145 : 


- -.8 ■ • 


,111 


".lie 


,121 


■ .115 ; . 122 .125 ": 


-.114 


.123 


: .154 


.6 


■ i 


,111 


.130 


.133 


.118 .134 .135 


■ .117 


131 


164 


J 


.6 ■ 


.11? 


'132 


. ,145 


.116 ,129 . .141 . 


. .119L 


.133 


,226: : 












Y .la + . 6x: 








.8 ' 


.9 


■ .114 


.120 


,i2r 


: .111 ,119 ., 120 


,114 


.120 


M33 


.8 


.8 


ai2^ 


.118.. 


.120 


.116 .122 .126 


.116 


.123 


.156 : 


.6 


-.8 


.111 




.126 


.114 .12? .129 


.119 


.131 


.165 


.6 


.6 


.110 




.129 


: ,112 .125 .136 


,116 


.129 


. .219 
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.115 


:,120 


.122 


,112; .118 .121 


.11? 


.125 


.158 


•6 


.6 


.113 


.m 


.136 


.115 .12? .139 


.118 


.129 


.220 



Ledgen; AS Actual Score Analysii of CovariancG 
CV Analysis of Covarianci 
TS True Score Analysis of Covariance 
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null distribution. Too many significant results were still obtained and the 
number of these appeared very sensitive to the size of the reliability correction. 
The analysis of coyariance of the actual scores under these conditions did 
result in a null distribution of significant £-tests* This set of samples 
substantiates the warning that covariance may not be appropriate for data sets 
where there are large group differences in the means of the covariate* As Lord 
(1963X,points out^ one should make every effort to keep this from happening by 
techniques such as random assignment of subjects to groups. 

While a simulation can never directly answer theoretical questions about 
statistical models^ it does give Important clues as to what important factors 
might be* For instance, why should the true score analysis produce so many 
Type-I errors? This probably happens because the addition of dependent variable 
information into the true score predictors Increases the R for the model as 
a whole. While most of this increase goes into the b^^ component in some 
samples it also gets into the other weights resulting in an eKcessive number of 
significant findings. There is* of course, no way of knowing when this will 
happen. To make matters worse. It is often very difficult to determine just 
what- the value of R^^ should be* Some of our own data indicates that it can 
vary by large amounts from group to groupj particularly for groups" occurring 
naturally. It Is not clear at all how one incorporates multiple values of 
R^^ into the reliability correction model. 

Very little has been said about the other two methods of analyzing change, 
the analysis of difference scores and the residualized gain analysis* In 
general, they did Just as one would expect from Werts and Linn* s paper. The 
distribution of Type-I errors paralleled that of analysis of covariance when 
the null hypothesis was true and both methods showed less power when it was 
not true. The sampling method used apparently produced very little correlation v 
between the covariate and_ the_ Syo^P factor because the residualiEed gain - 
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Table 4 _ 

Number ;of significant F-tests 
when the pretest means are different 
and the posttest maans lie on thf 
same regression line (R ^ .6) 

XX 











Expected number 


significant 




mean 






10 


50 


100 


250 


Lf f erencQ ' 


Correction 


Test 










----.6 ■ 


.8 


TS 


31 


104 


175 


346 






CV 


48 


143 


221 


410 


.6 


.8 


TS, 


118 


203 


266 


428 






CV 


44 


127 


..211 


409 


.2 


.6 


TS 


57 


"132 


'207 


372 






CV 


1^ 


57 


104 


266 




F^distributions were close Co covariance in almost every case* 

Clearly, the analysis of covariance is the method of choice to control T 
for Individual differences on a posttest measure. To some extent it will 
also control for group mean differences on the covariate but there are 
problems with this. Again, the researcher is cautioned to make tests for 
homogeneity and linearity of regression a standard procedure. The homogeneity 
of regression slopes test is especially important when there are mean pretest 
differences in the groups. 
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